fix(go-services): mysql healthcheck false-healthy race (kafka-ecommerce CI flake)#14
Open
slayerjain wants to merge 1 commit into
Open
fix(go-services): mysql healthcheck false-healthy race (kafka-ecommerce CI flake)#14slayerjain wants to merge 1 commit into
slayerjain wants to merge 1 commit into
Conversation
…it server The mysql-users/products/orders healthchecks used `mysqladmin ping -h localhost`, which hits MySQL 8.0's socket-only TEMPORARY init server (run to apply the seed db.sql before the real server starts) and passes on exit-code-only — it returns 0 even on "Access denied". So docker marked the container healthy ~3s before the real :3306 TCP listener was up. A service depending on it via `condition: service_healthy` then connected over TCP too early and failed during the temp-server -> real-server restart gap; docker's subsequent probes against the now-stopped temp server drove the failing streak to the retry limit -> "container unhealthy" -> "dependency failed to start". This is the intermittent kafka-ecommerce CI flake (it passes whenever timing happens to favour it). Fix: probe the REAL TCP listener with root creds (`mysqladmin ping -h 127.0.0.1 -P 3306 -uroot -proot`) — only the fully-started real server answers there, never the temp server — and add `start_period: 60s` so slow cold init under CI contention doesn't burn the retry budget before :3306 is up. Applied to all three mysql services. Signed-off-by: Shubham Jain <shubham@keploy.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Fix the intermittent
mysql-users/products/orders"container unhealthy" → "dependency failed to start" flake (seen in keploy/enterprise'skafka-ecommerceCI lane, which clones this repo'sgobranch and runsgo-servicesunderkeploy record).Root cause (traced)
The mysql healthchecks used
mysqladmin ping -h localhost. MySQL 8.0's entrypoint first runs a temporary, socket-only init server (to apply the seeddb.sql), then stops it and starts the real server on:3306.ping -h localhosthits that unix socket and checks exit code only — andmysqladminexits0even on "Access denied". A per-second trace caught the probe reportingmysqld is aliveat t=8s against the temp server, ~3s before the real:3306listener came up (t=11s).So docker marked the container healthy on the temp server. A dependent service (
condition: service_healthy) that connects over TCPmysql-users:3306then started too early and failed during the temp→real restart gap (~4-6s, wider under load); docker's next probes hit the now-stopped temp server, driving the failing streak to the 20-retry limit →unhealthy. There was also nostart_period, so cold-init failures under CI contention burned the retry budget.Fix
Probe the real TCP listener with root creds —
mysqladmin ping -h 127.0.0.1 -P 3306 -uroot -proot(only the fully-started real server answers on:3306, never the temp server) — and addstart_period: 60sso slow cold init doesn't count against the retries. Applied to all three mysql services.Validation
docker compose configvalidates.Access denied … (using password: NO)— i.e. healthy without a real connection, against the temp server.FailingStreakstays 0 through cold init (absorbed bystart_period); healthy only at t=16s withmysqld is aliveover real TCP:3306— never the temp server. 5/5 cold-start loops healthy.Only the three mysql healthchecks change (+30/−3); no service/app changes.